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Abstract 

Monte Carlo (MC) methods for numerical integration seem to be embarassingly par- 
allel on first sight. When adaptive schemes are applied in order to enhance convergence 
however, the seemingly most natural way of replicating the whole job on each processor 
can potentially ruin the adaptive behaviour. Using the popular VEGAS- Algorithm as 
an example an economic method of semi-micro parallelization with variable grain-size is 
presented and contrasted with another straightforward approach of macro-parallelization. 
A portable implementation of this semi-micro parallelization is used in the xloops-project 
and is made publicly available. 

keywords: parallel computing, grain-size, Monte Carlo integration, Tausworthe, GFSR. 

Program Summary 

Title of program: pvegas . c 

Computer and operating system tested: Convex SPP1200 (SPP-UX 4.2), intel x86 (Linux 2.0 
with SMP), DEC Alpha (OSF1 3.2 and Digital Unix), Sparc (Solaris 2.5), RS/6000 (AIX 4.0) 

Programming language: ANSI-C 

No. of lines in distributed routine: 530 



1 Introduction 

The Monte Carlo Method frequently turns out to be the only feasible way to get numerical 
results of integrals when ill-behaved integrands are involved of which no a priori-knowledge 
about their behaviour is available. It not only handles step functions and gives reliable 
error estimates but also has the desireable feature that the rate of convergence is dimension- 
independent. In the framework of the xloops-project [||, ||, for instance, massive two-loop 
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Feynman-diagrams with exterior momenta are calculated analytically as far as possible with 
sometimes very ill-behaved integrals left for numerical evaluation over finite two- or three- 
dimensional volumes. 

If a function / needs to be integrated over a D-dimensional volume f2, one can evaluate 
f{x) over N random sample-points Xi with i € {1 . . . N} and compute the estimate 

s m := Bj2f( Xi )^ f f( X )dx, (i) 



which has a convergence-rate of 1/y/N—l for large N. Similarly, 

has basically the same behaviour, if the probability density p(x) is normalized to unity in fi: 

/ p(x) dx = 1. 
Jn 

The introduction of the weight-function p is equivalent to a transformation in the integration 
variables 

/ f(x)dx= [ f(P(y)) ^ dy 

where the transformation leaves the boundary d£l unchanged. 

The adaptive Monte Carlo Method now tries to effectively improve the rate of convergence 
by choosing p(x) properly. As is well known, the modified variance a for large ./V is given by: 



^(sf-tS'") 2 ) (3) 



with 



r-(2) = f f( X i) 

P N 2^ 



N ^r 1 \p(x 



The Central Limit Theorem implies that for square integrable f(x) the distribution of 
around the true value becomes Gaussian and a in (||) is a reliable error-estimate. As every 
method of selecting a proper p must rely on information about the integrand, only approx- 
imate and/or iterative methods are practically in use. The popular vegas-algorithm uses 
two of these. We will sketch them in a brief discussion of vegas in Section |2[ Section || 
contains some warnings about a sometimes seen oversimplified macro-parallelized vegas and 
in Section || our approach is presented together with some real-world measurements of effi- 
ciency. At some places explicit variable-names are mentioned for those readers familiar with 
G. P. Lepage's original code 0. 



2 About VEGAS 

The two techniques used by vegas to enhance the rate of convergence are importance sampling 
and stratified sampling. Importance sampling tries to enhance a weight-function p(x) by 
drawing from previous iterations. It is well known, that the variance a is minimized, when 



p{x) = \f{x)\l J Q \f(x)\dx. 
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This method concentrates the density p where the function is largest in magnitude. Stratified 
sampling attempts to enhance the iV -1//2 -behaviour of MC integration by choosing a set of 
random numbers which is more evenly distributed than plain random numbers are. (Recall 
that the simplest method of stratification would evaluate the function on a Cartesian grid 
and thus converge as N~ l .) This is done by subdividing the volume Q into a number k of 
hypercubes {Gi},i £ {1 . . . k} and performing an MC integration over N/k sample-points 
in each. The variance in each hypercube can be varied by shifting the boundaries of the 
hypercubes between successive iterations (Figure |l]a shows an initial grid, |l|b the grid at a 
later stage). The optimal grid is established when the variance is equal in all hypercubes. 
This method concentrates the density p where both the function and its gradient are large in 
magnitude. The split-up of 0, into k hypercubes turns out to be the key-point in efficiently 
parallelizing vegas. 

The way vegas iterates hypercubes across the whole volume is designed in a dimension- 
independent way. In effect it just amounts to D loops packed into each other, each iterating 
from the lower limit of integration to the upper one. We'll see in section |J] how this looping 
can be exploited for parallelization with variable grain-size. For a more thorough discussion 
of vegas the reader is referred to the literature |4|, [3|, || . 



a) b) 




Figure 1: The adaptive behaviour of vegas. Equal numbers of points (traditionally, the 
variable npg counts them) are sampled in each hypercube. 

3 Macro-Parallelization 

The most straightforward approach to make use of a parallel machine with p processors is to 
simply replicate the whole job. Instead of having one processor calculate N sample-points, p 
instances of the integrator ("workers") are started, each sampling N/p points. Subsequently, 
the results from each processor are averaged taking into account their differing error-estimates. 
We call this approach macro-parallelization. 

It is immediately clear that this is trivial to implement and usually results in good perfor- 
mance since the amount of communication among processors is minimized. This approach, 
however, results in p different grids, each less fine than the original one. If the same number of 
points is sampled in each hypercube and the overall number of points are equal, the amount 
by which the grid will be coarser is given by the dilution of hypercubes which is ~ p^ 1 . Fur- 
thermore, in an extreme situation some of the workers might accidentally miss an interesting 
area and return wrong results way outside the estimated error-bounds and thus completely 
fail to adapt. 
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We have seen realizations of a slightly improved method which does not suffer from over- 
estimation of single partial results. This method spawns p workers, again each evaluating 
N/p points, and lets each evaluate the cumulative variables^ and send them to the parent 
which adds them and subsequently computes the new grid. This method will still suffer from 
coarse grids but it will adapt more cleanly. In effect, it amounts to synchronizing the grids 
between workers. 

Table |l] exemplifies the problems typical for macro-parallelization. It shows the results 
of an integration over a unit-square. The test-function was a narrow Gaussian peak with 
width 10 -3 and normalized such that the exact result of the integration is unity. All runs 
were typical vegas-calls: The first 10 iterations were used only to refine the grid, their results 
were discarded (entry-point 1 in vegas-j argon). In that particular case the macro-parallelized 
version took 5 iterations until every processor had "detected" the peak. The ones that failed 
to detect it returned very small values with small error-bounds and the common rules for 
error-manipulation then grossly overestimated the weight of these erroneous results. The 
unparallelized version in contrast was able to adapt to the function's shape very early. The 
last 5 iterations were cumulative: each iteration inherited not only the grid but also the 
result of the previous one (entry-point 2). Note also that after the grids have adapted to 
the situation, the macro-parallelized vegas without synchronization still returns misleading 
error-bounds. 



Table 1: 15 iterations of two macro-parallelized vegas (with p = 16) integrating a sharp 
Gaussian contrasted with an unparallelized run. Equal numbers of function calls were sampled 
in each run. 



calls: 



macro-parallelized: 



• io- 39 

• 1(T 35 

• io- 34 

. 10-20 
0.998361 
0.996855 
0.997329 
0.993738 
0.992781 



macro-parallel, (sync. 



unparallelized: 



5 000 • 16 



4.2 
1.2 
1.9 
6.0 
5.7 



±2.2 
±3.5 
± 5.0 
± 5.0 
±5.7 



10 



1Q -33 
1Q -33 
1Q -33 
1Q -20 



± 0.000492 
± 0.000635 
±0.001107 
± 0.002539 
± 0.001604 



0.000086 ± 0. 
1.032088 ± 0. 
1.000428 ± 0. 
0.999933 ± 0. 
0.998949 ± 0. 
1.000323 ± 0. 
1.000582 ± 0. 
0.997636 ± 0. 
0.997168 ± 0. 
0.998939 ± 0. 



000069 
041060 
000283 
000143 
000187 
000980 
000676 
000311 
000160 
001081 



0.537686 
0.993894 
1.000008 
0.999724 
1.001166 
0.999365 
1.000288 
1.000052 
1.000017 
1.000019 



± 0.537683 
± 0.013399 
± 0.000067 
± 0.000080 
±0.001102 
± 0.000603 
± 0.000295 
± 0.000127 
± 0.000088 
± 0.000068 



20 000 • 16 



0.993782 
0.993844 
0.994918 
0.996218 
0.997030 



±0.001661 
± 0.001446 
±0.001192 
±0.001213 
± 0.001070 



0.999631 ± 0. 
0.999705 ± 0. 
0.999693 ± 0. 
0.999611 ± 0. 
0.999720 ± 0. 



000827 
000355 
000325 
000316 
000255 



1.000016 
0.999937 
0.999940 
0.999957 
0.999995 



± 0.000057 
± 0.000046 
± 0.000040 
± 0.000035 
± 0.000021 



The macro-parallelized version with grid-synchronization performs better than the one 
without but still is less able to adapt to the specific integrand, as expected. 

Of course the results of this extreme situation are less pronounced for better-behaved 



1 The cumulative variables are the real scalars traditionally called ti and tsi as well as the real arrays d 
and di. 
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integrands but the general result always holds. It is just a manifestation of a fact well-known 
to people using vegas: Few large iterations generally result in better estimates than many 
small ones. 

4 Semi-Micro-Parallelization 

What is desired is a method that parallelizes the algorithm but still has the same numerical 
properties as the sequential version. As has been shown in the previous section, this cannot 
be achieved on a macroscopic level. Fortunately, vegas does offer a convenient way to split up 
the algorithm and map it onto a parallel architecture. Still using a well-understood farmer- 
worker-model, our approach tries to distribute the hypercubes that make up the domain of 
integration to the workers. 

MC integration does not exhibit any boundaries which would need extensive communica- 
tion among workers but it does need some accounting-synchronization to guarantee that each 
hypercube is evaluated exactly once. A straightforward broadcast-gather-approach would re- 
quire one communication per hypercube and would thus generate an irresponsible amount 
of overhead spoiling efficent scalability. We therefore suggest having each processor evaluate 
more than one hypercube before any communication is done. Let r be the number of equal 
fractions the whole volume is split up into. Ideally, we should require the number r of fractions 
to be much smaller than the number k of hypercubes {Gj}: r <C k. 

The problem of dynamic load-balancing can in practice be solved by making the number 
of fractions r much larger than the number of processors p^\ p <C r. 

We thus arrive at the constraint: 

p < r < k. (4) 

This inequality can be satisfied in the following convenient way opening up an algo- 
rithmically feasible way to implement it: Projecting the D-dimensional volume onto a Dn- 
dimensional subspace defines a set of Dn -dimensional sub-cubes. The set of original hyper- 
cubes belonging to the same sub-cube make up one fraction to be done by one worker in 
a single loop. We thus identify r with the number of sub-cubes. Because the hypercubes 
belonging to different sub-cubes can be evaluated concurrently we call the Dm -dimensional 
subspace the parallel space and its orthogonal complement the Dj_-dimensional orthogonal 
space (D = Dn +Dj_). Choosing Dn = [D/2\ and D± = \D/2] can be expected to satisfy @ 
for practical purposes (Figure ||). 

4.1 Random Numbers 

An important issue for every MC-effort is the random number generator (RNG). There are 
two different ways to tackle the problems arising in parallel simulations |pXf| : 

• One single RNG pre-evaluates a sequence of random numbers which are then assigned 
without overlap to the processors. 

2 One might argue that letting r be an integer multiple of p would fit best on p nodes. However, this is only 
valid if one assumes that the function exhibits the same degree of complexity in the whole volume and if each 
node is equally fast. Both assumptions are usually unjustified. 
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Figure 2: Example for a parallelization of vegas with D = 3, Du =1 and D± = 2. (The 
shaded volume is done by one processor.) 

• Every processor gets a RNG of its own and some method has to guarantee that no 
correlations spoil the result. 

A look at Amdahl's Law shows that the second approach is the more attractive one. Amdahl's 
Law relates the speedup S for parallel architectures with the number of processors p and the 
fraction of code a which is executed in parallel: 

S = ((l-a)+p/a)- 1 . (5) 

Use of concurrently running RNGs increases a which in turn results in an improved speedup. 

Most compiler libraries provide linear congruential generators which generate sequential 
pseudorandom numbers by the recurrence 

Xi + i := (aXi + c) mod m (6) 

with carefully selected a, c and m. Because of their short period and long-range correlations 
reported by De Matteis and Pagnutti || which make parallelization dangerous this type is 
not suited for large MC-simulations. 

For our case we therefore decided to build on a slightly modified shift register pseudoran- 
dom number generator (SR) |7], [8|, ^, This widely employed class of algorithms (R250 is 
one example) generates random-bit sequences by pairwise XORing bits from some given list 
of P binary numbers Xq,Xi, . . . xp-\: 

x k := x k _ P © x k _ P+ Q (k > P) (7) 

Here, © represents the exclusive-or (XOR) operator and P and Q are chosen such that the 
trinomial 1 + x p + x® is primitive modulo two. The so defined 'Tausworthe-sequence' |7| is 
known to have periodicity 2 P — 1. Thus, every combination of P bits occurs exactly once with 
the only exception being P subsequent zeros (which would return the trivial sequence of zeros 
only, if it occured) . Tables of "magic-numbers" P and Q can be found in the literature || [l0| 
and are provided with our program-sources. Note that owing to its exponential growth with 
P, the periodicity easily reaches astronomical lengths which can never be exploited by any 
machine. 

Uniformly distributed random L-bit integers can now be constructed easily by putting 
together columns of bits from several instances of the same Tausworthe-sequence {xi} with 
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predefined delays d n € {0 . . . 2 P — 1}: 

L-l 

X i = 1 £x i+dn 2 n . (8) 

n=0 

Floating-point numbers in the range [0, 1) can subsequently be computed by dividing Xi by 
2 L . In the continuous limit of large L such a random-sequence will have mean X = 1/2, 
variance a 2 = 1/12 as well as the enormous length of the original bit-sequences which in 
turn guarantees -D-space uniformity for D < P/L. In addition, this method is extremely fast 
because the machine's word-size and XOR-operation from the native instruction-set can be 
exploited. 

The good properties of this class of generators can be ruined by improper initialization. 
Lewis and Payne Q for instance, suggested initializing the Tausworthe-sequence with every 
bit set to one, introduce a common delay d = 100P between each column of bits and throw 
away the first 5000P iterations in order to leave behind initial correlations. This is not only 
slow (even if a short-cut described by I. Deak [1C] is used), but also results in perspicuous 



correlations if P only becomes large enough. This is a direct result of the exponential growth 
of the period while the delay and initial iterations grow only linearly. 

A quicker and less cumbersome initialization procedure was suggested by Kirkpatrick 
and Stoll ||. They noted that initializing the Tausworthe-sequence with random-bits from 
some other generator, will define an (unknown) offset somewhere between and 2 P — 1 in 
the sequence (0) from which iteration can proceed. Initializing every column of bits in the 
integer-sequence (Q) with such random-numbers defines different offsets and thus implicitly 
defines a set of delays {d n } as well as the starting-point of the whole sequence. This method 
does clearly not suffer from initial correlations. 

The Method of Kirkpatrick and Stoll offers a clean and efficient way for parallelization: 
As many generators as there are processors can be initialized by random numbers from some 
generator, for example a simple and well- understood linear congruential one. Only the X n , n £ 
{0 . . . P— 1} of each of the p generators need to be filled. The probability that two of these 
generators will produce the same sequence because they join the same set of delays {d n } can 
be made arbitrary small by simply choosing P big enough. 

To rule out correlations among the p sequences is equivalent to assuming there are no 
interactions between the shift-register generator and the linear congruential generator. Indeed, 
the methods and the underlying theory are quite different. The method is however still 
plagued by the known flaws, common to all shift register generators. One examples is the 
triplet-correlation [11 1. It can in principle be cured by an expansion of the method described 



in [12J. In the case of vegas however, we see no reason why high-quality RNGs should be 
needed at all and we therefore advocate using simple generators with P > 607: Stratification 
lets short-range-correlations only take effect within the hypercubes inside the rectangular grid 
where very few points are sampled and long-range-correlations become washed out by the grid 
shifting between iterations. This view is supported by the observation that correlations in 
SR-generators seem to have been discovered only in calculations more sophisticated than plain 
MC integration 0, 0, [14 1 . 



4.2 Evaluation 

Figure shows the efficiency at integrating a function consisting of the sum of 8 Dilogarithms 



computed with a method suggested in 15]. The parameters have been chosen such that all the 
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characteristic properties become visible in one single run. The five-dimensional volume was 
split up into a two-dimensional parallel space and a three-dimensional orthogonal space with 
each axis subdivided into 21 intervals. 2 • 21 5 ~ 8 • 10 6 points were evaluated in each iteration. 
What we see are some minor fluctuations modulated on a rather good overall efficiency. The 
SPP1200 consists of hypernodes with 8 processors running in real shared memory each, hence 
the drop-off at p ~ 8 where the second hypernode across an interconnect is first touched. 
The behaviour for small p is thus machine-specific. The sawtooth for larger p, in contrast, is 
characteristic for the algorithm: As the test function does not involve steps or other changes in 
cost of evaluation, most processors terminate the job assigned to them rather simultaneously. 
So, at p = 40 we see each processor evaluating 11 of the w = 21 2 = 441 fractions and then one 
processor evaluating the single remaining one. The algorithm thus needs 12 times the time 
necessary for evaluating one fraction while at p = 41 • • • 44 it needs only 11. This behaviour 
can easily be stopped by raising the dimension of the parallel space to three for instance, thus 
decreasing the grain-size. The obvious drawback is an incremented communication-overhead. 
The ideal split-up has to be determined individually for each combination of hardware and 
problem. For a given p, astute users will probably tune their parameters N and Dy judiciously 
in order to take advantage of one of the peaks in Figure ||. 

Efficiency on a 48-Processor Convex SPP1200 
^vZ^T H I I 5-5 problem — — 



0.95 



0.9 



0.85 



16 24 32 

number of processors p 



40 



Figure 3: Efficiency of a semi-micro parallelized version of vegas on the Convex Exemplar 
architecture. (See the article for a discussion of this curve.) 



5 Conclusion and availability 

We have shown, that for ill-behaved test functions in adaptive MC integrators it is essential to 
use large sets of sample-points at a time. Under these circumstances a macro-parallelization 
is not satisfying stringent numerical needs. For the xloops project 0, §], we have developed 
a version of vegas which does parallelization on a smaller scale and has the same numerical 
properties as the original one. For D > 4 the grain-size of the algorithm becomes a parameter. 
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The algorithm can be used as a complete drop-in replacement for the common vegas. It 
is currently being used in xloops, where it does the last steps in integrating massive 2-loop 
Feynman diagrams. 

A portable implementation in ANSI-C of the outlined algorithm running on every modern 
SMP-Unix (either featuring Pthreads ||16|| , Draft 4 Pthreads or CPS-threads) can be found 
at f tp : / /higgs . physik . uni-mainz . de/pub/pvegas/| . Hints on how to use it can be found 
at the same place. Using the strutures outlined above, it should be easy to implement a 
mpivegas running on machines with distributed memory. Upon demand, we can provide 
such a routine, using the MPI message-passing standard 17]. 
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